Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_15

15. Linear Mixed-Effects Models for Longitudinal Microbiome Data

Yinglin Xia¹ and Jun Sun ¹

(1)

Department of Medicine, University of Illinois Chicago, Chicago, IL, USA

Abstract

Longitudinal microbiome data analysis can be categorized into two approaches: univariate longitudinal analysis and multivariate longitudinal analysis. Univariate analysis analyzes the change of one taxon or alpha diversity over time. Multivariate analysis directly analyzes the change of multiple taxa simultaneously or distance/dissimilarity (beta diversities) over time. This chapter introduces using the classical univariate linear mixed-effects models (LMMs) to analyze microbiome data. First, it describes linear mixed-effects models (LMMs), and then it introduces implementation of LMMs to identify the significant taxa using the nlme package, and model the diversity indices using the lme4 and LmerTest packages, respectively. Next, it introduces how to fit LMMs using QIIME 2 and provides some remarks on longitudinal microbiome studies based on LMMs.

Keywords

Linear mixed-effects models (LMMs) Fixed effect Random effect Hypothesis tests nlme package lme4 package LmerTest package Maximum likelihood (ML) AIC BIC Restricted maximum likelihood (REML) Generalized linear model (GLM) Autoregressive of order 1 [AR(1)] KRmodcomp() pbkrtest qiime longitudinal linear-mixed-effects Volatility analysis

We can categorize longitudinal microbiome data analysis into univariate and multivariate approaches. Univariate analysis analyzes one taxon per time or first summarizes microbiome abundances into alpha diversities and then analyzes these diversities one by one. Multivariate analysis either analyzes multiple taxa simultaneously or directly tests distance/dissimilarity (beta diversities). This chapter focuses on introducing how to use the classical univariate linear mixed-effects models (LMMs) to analyze microbiome data. First, we describe linear mixed-effects models (LMMs) (Sect. 15.1), then we introduce how to implement LMMs to identify the significant taxa using the nlme package (Sect. 15.2) and model the diversity indices using the lme4 and LmerTest packages (Sect. 15.3), respectively. In Sect. 15.4, we introduce how to fit LMMs using QIIME 2. In Sect. 15.5, we provide some remarks on longitudinal microbiome studies based on LMMs. Finally, we summarize the topics of this chapter (Sect. 15.6).

15.1 Introduction to Linear Mixed-Effects Models (LMMs)

In this section, we introduce the advantages of LMMs, the development of fixed and random effects model and definition of LMMs, as well as how to conduct statistical hypothesis testing and fit LMMs.

15.1.1 Advantages and Disadvantages of LMMs

The distinctive characteristic of longitudinal study is that the subjects are measured repeatedly during the study, permitting directly assess the changes in response variable over time (Fitzmaurice et al. 2004; Diggle et al. 2002). Thus, a longitudinal study can capture between-individual differences (heterogeneity among individuals) and within-subject dynamics. Linear mixed-effects models (LMMs, aka multi-level modeling) are an important class of statistical models incorporating fixed and random effects. LMMs as a statistical method can be used to analyze correlated or non-independent data. Such data are often collected in the settings of longitudinal/repeated measurements in which the same statistical units were repeatedly measured or clustered observations where the clusters of related statistical units were measured. Such data also arise from a multilevel/hierarchical structure.

Because the circumstances of the measurements often cannot be fully controlled, considerable variation among individuals in the number and timing of observations may exist, hence resulting in unbalanced datasets. Typically, such unbalanced datasets are challenge to be analyzed using a general multivariate model with unrestricted covariance structure (Laird and Ware 1982). Moreover, missing values often arise in longitudinal/repeated measurements. Traditional statistical methods such as repeated measure analysis of variance are difficult to deal with missing values.

Mixed effects models often have advantages in dealing with unbalanced datasets and missing values.

LMMs integrate two (hierarchical) levels of observations of longitudinal data, i.e., within-group (subject) and between-group (subject) in a single model (Harville 1977; Arnau et al. 2010) and allows a variety of variance/covariance structures or correlation patterns to be explicitly modeled, which provides the opportunity to accurately capture the individual profile information over time. Typically the fixed effects are used to model the systematic mean patterns (i.e., treatment conditions), while random effects are used to model two types of random components: the correlation patterns between repeated measures within subjects or heterogeneities between subjects or both (Lee et al. 2006). That is, random effects estimate the variance of the response variable within and among these groups to reduce the probability of false positives (Type I error rates) and false negatives (Type II error rates) (Crawley 2012; Harrison et al. 2018).

In summary, the desirable features of LMMs are that they are not required for balance in the data and allow explicitly modeling and analysis of between- and within-individual variation (Laird and Ware 1982). Thus, LMMs have the capability of modeling variance and covariance (random effects) which makes this method more preferred to the classical linear model (Xia and Sun 2021). However, LMMs have the major limitation compared to the general multivariate model: a special form of the covariance structure should be assumed (Laird and Ware 1982).

15.1.2 Fixed and Random Effects

The core feature of LMMs is that they incorporate fixed and random effects and hence can analyze the fixed and random effects simultaneously. A fixed effect refers to a parameter that does not vary, whereas a random effect represents a parameter that is itself a random variable. In other words, the distribution of these parameters (or “random effects”) vary over individuals (Laird and Ware 1982).

LMMs have a long history of development. In 1918 Ronald Fisher introduced the random effects models to study the correlations of trait values between relatives such as the correlation of parent and child (Fisher 1918; Laird and Ware 1982). Since the 1930s, researchers have emphasized taking account of random effects as accurately as possible when estimating experimental treatment effects (fixed effects) (Thompson 2008). In the 1950s, Charles Roy Henderson provided best linear unbiased estimates (BLUE) of fixed effects and best linear unbiased predictions (BLUP) of random effects (Robinson 1991; Henderson et al. 1959; McLean et al. 1991). Subsequently, Goldberger (1962), Henderson (Henderson 1963, 1973, 1984; Henderson et al. 1959), and Harville (1976a, b) developed mixed model procedures. The mixed model equations (Henderson 1950; Henderson et al. 1959) introduced by Charles Roy Henderson offers the base for a methodology that provides flexibility of fitting models with various fixed and random elements with the possible assumption of correlation among random effects (McLean et al. 1991; Thompson 2008).

15.1.3 Definition of LMMs

In matrix form a linear mixed model can be specified as:

$y= X\beta + Zu+\varepsilon, \kern0.36em u\sim {N}_q\left(0,G\right),\kern0.48em \varepsilon \sim {N}_n\left(0,R\right),$

(15.1)

where y is a n × 1 column vector, the response (outcome) variable; X is a n × p design matrix of fixed-effects parameters (i.e., for the p predictor variables); β is a p × 1 column vector of the fixed-effects parameters representing regression coefficients (the βs); Z is the n × q design matrix for the q random effects; u is a q × 1 vector of q random effects (the random complement to the fixed β); and ε is a n × 1 column vector of the residuals, which is part of y that is not explained by the model, Xβ + Zu. u and ε are independent and R = σ²I.

G is the variance-covariance matrix of the random effects. Given the fixed effects are directly estimated, the random effects are just deviations around the value in β, which is the mean. Thus it the variance that is left to estimate. Various structures of the variance-covariance matrix of the random effects can be assumed and specified. If only a random intercept is specified, then G is just a 1 × 1 matrix, which models the variance of the random intercept. If a random intercept and a random slope are assumed, then we can specify:

$G=\left[\begin{array}{cc}{\sigma}_{\mathrm{int}}^2& {\sigma}_{\operatorname{int},\mathrm{slope}}^2\\ {}{\sigma}_{\operatorname{int},\mathrm{slope}}^2& {\sigma}_{\mathrm{slope}}^2\end{array}\right]$ . If we assume that the random effects are independent, then the true structure is $G=\left[\begin{array}{cc}{\sigma}_{\mathrm{int}}^2& 0\\ {}0& {\sigma}_{\mathrm{slope}}^2\end{array}\right]$ . In Chap. 16 (Sect. 16.2), we will introduce generalized linear mixed models (GLMMs).

15.1.4 Statistical Hypothesis Tests

In the statistical model building for experimental or observational data, LMMs can be used to serve two purposes (Bates 2005): (1) to characterize the dependence of a response (outcome variable) on covariate(s), conditioning on the treatment status and the time under treatment, and (2) to characterize the “unexplained” variation in the response.

A mixed-effects model (or more simply a mixed model) incorporates both fixed-effects terms and random-effects terms, which use a different way to model repeatable covariates and non-repeatable covariates (Bates 2005; Bates et al. 2015). The fixed-effects terms are used for a repeatable covariate to characterize the change in the response between different levels such as typically the change of the response over time under treatment or the difference of a response between treatment and the control groups. In contrast, the random-effects terms are used to characterize the variation induced in the response by the different levels of the non-repeatable covariate (a grouping factor).

Hypothesis Testing of the Fixed-Effects

To test a hypothesis on the fixed effects β, we can use likelihood-ratio test (LRT). The LRT can be constructed by specifying a nested (smaller) model with the same error structure as model (15.1).

$y={X}_0{\beta}_0+ Zu+\varepsilon .$

(15.2)

The LRT statistic for the test of the hypothesis is given:

$\begin{array}{l}{H}_0=\beta \in {\Theta}_{\beta_0},\\ {}{H}_a=\beta \in {\Theta}_{\beta },\end{array}$

where ${\Theta}_{\beta_0}$ is a subspace of the parameter space Θ_β of the fixed effects β. The log-likelihood ratio test statistic for the null hypothesisis is given by:

${\lambda}_{\mathrm{LRT}}=2\left( ll-{ll}_0\right),$

where ll and ll₀ are the log-likelihoods of the models in Eqs. (15.1) and (15.2), respectively. Under the hull hypothesis, λ_LRT follows asymptotically a χ² distribution. LRT is often used, but it tends to produce “anticonservative” P-values and sometimes quite badly so (Pinheiro and Bates 2000) (p. 88). Thus, an F test (although overall it does not exactly follow an F distribution) of the null hypothesis about the fixed effects β: H₀ : Lβ = 0, is considered in the literature with L is a contrast matrix of q = rank (L) > 1. The test statistic for this null hypothesis is given:

$F=\frac{{\left(L\hat{\beta}\right)}^T{\left(L\hat{C}{L}^T\right)}^{-1}\left(L\hat{\beta}\right)}{q},$

(15.3)

We will describe more details about the LRT and other tests in Chap. 16 (Sect. 16.5.4 for the LRT).

15.1.5 How to Fit LMMs

To fit a LMM, we typically should start with a full model, i.e., including all independent variables of interest and some interaction terms of main factors. Then we can evaluate parameters and compare sub-models. In literature, some protocols and procedures have been proposed and discussed to avoid common and particularly statistical problems related to mixed-effects modeling and multi-model inference in ecology (Zuur et al. 2010; Zuur and Ieno 2016; Harrison et al. 2018). Here, we describe some main points based on our experience using LMMs and the discussion in the literature.

First of all, exactly determining whether a variable is a random or fixed is controversial.

Specifying a particular variable as fixed or random effect or even both simultaneously largely depends on experimental design and context. Generally fixed effects refer to all factors whose levels are experimentally determined or whose interest is in the specific effects of each level, such as treatments, time, other covariates, as well as treatment and time (or covariates) interactions. In contrast, random effects refer to all factors that qualify as sampling from a population or whose interest is in the variation among the population rather than the specific effects of each level. Random effects typically represent some grouping variables in both subject and unit levels (Breslow and Clayton 1993), such as individuals in repeated measurements, field trials, plots, blocks, batches.

Random effect models have several desirable properties, but it is more challenging to use them. Harrison et al. (2018) very well described four considerations when fitting random effects. We summarized them here:

1.
Fitting random effects requires at least five “levels” (groups) for a random intercept term to achieve robust estimates of variance (Gelman and Hill 2007; Harrison 2015). The mixed model may not be able to estimate the among-group variance accurately for less than 5 levels (Harrison et al. (2018) because if the random effect models have less than 5 levels, then the variance estimate will either collapse to zero, making the model become an ordinary GLM (Gelman and Hill 2007) or be non-zero which is incorrect if the small number of groups that were sampled are not representative of true distribution of means (Harrison 2015). Thus, in practice, as a rule of thumb, the factors with fewer than 5 levels are considered as fixed effects, while the factors with numerous levels are considered as random effects in order to increase accurately estimating variance.
2.
Models and especially the random slope models will be unstable if sample sizes across groups are highly unbalanced (Grueber et al. 2011).
3.
It is difficult to determine the significance or importance of variance among groups.
4.
Mis-specification of random effects or incorrectly parameterizing the random effects in the model could results in the model estimates being unreliable as ignoring the need for random effects altogether (Harrison et al. 2018).

Second, start with a full model to include all potential independent variables in the fixed model using maximum likelihood (ML).

Depending on experimental design and hypothesis, include as many as possible predictors and their interactions in the fixed-effect model. A linear regression (ordinary least squares) model can be fitted to compare with the fixed effect only model. The fixed-effect model is fitted under the statistical framework of ML to find the parameters of a model that maximizes the probability of the observed data (the likelihood). It is suggested to use information criteria (e.g., AIC and/or BIC, see Chap. 16 for details) to compare the model fit and perform variable selection. Because LRT is unreliable for small to moderate sample sizes, it is not recommended for testing fixed effects in GLMMs (Pinheiro and Bates 2000). To test fixed effects or compare any models with different fixed effects using LRT, the model is required to be fitted with ML too. Based on the significant fixed effects using ML estimation, an optimal fixed model structure will be determined, including the important main effects and their interactions. For more details on model selection in GLMMs, please see Chaps. 16 and 17.

Third, fit the random model using restricted maximum likelihood (REML) to optimize the random structure.

The goal of this step is to find the optimal variance structure in terms of heterogeneity. Thus, it is required to fit the model with REML to estimate the random-effect parameters (i.e., standard deviations) over the averaged values of the fixed-effect parameters. Because REML assumes that the fixed effects are unknown, thus generally REML estimation is less biased compared to corresponding ML estimation (Veroniki et al. 2016; Zuur et al. 2009). An identical generalized linear model (GLM) with REML can be fitted to serve as a reference for the fitted LMMs. Based on the significant random effects using REML estimation, an optimal random (variance) model structure will be determined. That is, the random intercept, random slopes and crossed or nested effect can be specified as the random effects. An error structure can also be chosen.

Finally, refit the optimal fixed and random models using REML, and validate and interpret the modeling results.

After updating (refitting) optimal model with REML, we can check the differences in the main effects and effects in interaction terms caused by the introduction of random effects. A diagnosing model fit procedure can be conducted to validate the fitted models. The reader is referred to Chap. 17 for details. We then can interpret the results based on the prior hypotheses.

15.2 Identifying the Significant Taxa Using the nlme Package

In this section, we introduce LMMs in microbiome research and illustrate how to fit LMMs to identify significant microbial taxa.

15.2.1 Introduction to LMMs in Microbiome Research

The microbiome is inherently dynamic, driven by interactions with the host and the environment, and varies over time. Longitudinal microbiome data analysis can enhance our understanding of short- and long-term trends of microbiome by intervention, such as diet, and the development and persistence of chronic diseases caused by microbiome. Thus, longitudinal microbiome data analysis provides rich information on the profile of microbiome with host and environment interactions (Xia et al. 2018a).

In the literatures of longitudinal statistical analysis of physical, biological, psychological, social, and behavioral sciences, the linear mixed-effects models (LMMs) (Laird and Ware 1982) are often used. The classical LMM method was also one of the first models that were used to analyze longitudinal microbiome data (e.g., Rosa et al. 2014; Kostic et al. 2015; DiGiulio et al. 2015; Wang et al. 2015) and currently is still used in microbiome and multi-omics literature (Lloyd-Price et al. 2019). The reasons that LMMs have been chosen to analyze longitudinal microbiome data mainly because (1) LMMs equip with fixed- and random-effects components, which provides a standardized and flexible approach to model both fixed- and random-effects. (2) LMMs can jointly model measures over all the time points to account for time-dependent correlations in longitudinal microbiome study designs. Specifically because (3) LMM method was an established methodology to remove the effects of fixed- and random-effect confounding variables (Ernest et al. 2012; Xia 2020) in the areas of microarray, genome-wide association studies (GWAS), and metabolomics (Fabregat-Traver et al. 2014; Zhao et al. 2019), which motivates its application to microbiome data.

15.2.2 Longitudinal Microbiome Data Structure

Microbiome data (i.e., OTUs/ASVs table) generated by each sample through either the 16S rRNA gene sequencing or the shotgun metagenomics sequencing consist of read counts of taxa at certain taxonomic levels, such as kingdom, phylum, class, order, family, genus, and species. In longitudinal microbiome study, each subject are measured at multiple time points (i.e., samples). Assume that there are n subjects, and subject i is measured at n_i time points t_ij, j = 1, …, n_i and i = 1, …, n. Denote C_ijm as read counts for subject i measured at time j for taxon m, m = 1, …, m, and denote the total sequence read counts for subject i at time j as T_ij, which is referred to as library size or depths of coverage; then a longitudinal microbiome data structure can be summarized in Table 15.1.

In Table 15.1, X_i denotes clinical or environmental variables for each subject. Under this data structure and notations, we can formulate a longitudinal microbiome model to identify the associations between the microbiome counts and the host trait of interest (i.e., covariate variables) and characterize the time trends of microbiome abundance both within subjects and between subjects.

Table 15.1

Longitudinal microbiome data structure

Subject-ID	Taxon 1	Taxon 2	⋯	Taxon m	Total reads	Covariate	Time
Subject 1	C₁₁₁	C₁₁₂	…	C_11m	T₁₁	X₁	t₁₁
Subject 1	C₁₂₁	C₁₂₂	…	C_12m	T₁₂	X₁	t₁₂
Subject 1	C₁₃₁	C₁₃₂	…	C_13m	T₁₃	X₁	t₁₃
Subject 2	C₂₁₁	C₂₁₂	…	C_21m	T₂₁	X₂	t₂₁
Subject 2	C₂₂₁	C₂₂₂	…	C_22m	T₂₂	X₂	t₂₂
Subject 2	C₂₃₁	C₂₃₂	…	C_23m	T₂₃	X₂	t₂₃
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
Subject n	C_n11	C_n12	…	C_n1m	T_n1	X_n	T_n1

To analyze the associations between the microbiome and the host trait of interest (i.e., covariate variables), two approaches can be applied via LMMs: (1) fit LMMs using the read counts as the outcome and (2) fit LMMs using the diversity index as the outcome. The LMMs can be fitted using the nlme (current version 3.1-157, March 2022) and lme4 (current version 1.1-29, April 2022) packages. Below we illustrate their uses through our dataset of breast cancer QTRT1 mouse gut microbiome.

15.2.3 Fit LMMs Using the Read Counts as the Outcome

When applying LMM to fit microbiome and multi-omics abundance data, raw microbiome abundances should be transformed or normalized by an appropriate method (Xia 2020): (1) arcsine square root transformation (Kostic et al. 2015; Rosa et al. 2014), (2) log-transformation (with pseudocount 1 for zero values) (Lloyd-Price et al. 2019), and (3) log-transformation with no pseudocount for expression ratios (non-finite values removed) (Lloyd-Price et al. 2019).

It is arguable whether or not these transformation/normalization methods are appropriate to each outcomes of interest. Especially adding a pseudocount 1 for zero values may even not make sense because it forces the microbiomes from “nothing” (absence) to “being”(presence).

Example 15.1: Breast Cancer QTRT1 Mouse Gut Microbiome, Example 9.1, Cont.

In this dataset, the mouse fecal samples were collected at two time points: pretreatment and posttreatment. Below we use LMMs to identify the significant change species over time.

> setwd("~/Documents/QIIME2R/Ch15_LMMs")

Step 1: Load taxa abundance data and meta data.

> otu_tab=read.csv("otu_table_L7_MCF7.csv",row.names=1,check.names=FALSE)

> meta_tab <- read.csv("metadata_QtRNA.csv",row.names=1,check.names = FALSE)

Step 2: Check taxa and meta data.

> # Check the numbers of taxa and samples

> nrow(otu_tab)#taxa

[1] 59

> ncol(otu_tab)#samples

[1] 40

> # Check the numbers of samples and variables

> nrow(meta_tab)#samples

[1] 40

> ncol(meta_tab)#variables

[1] 6

> head(otu_tab,3)

> head(meta_tab,3)

For the modeling of LMMs, the taxa need to be in the columns. We use the t() function to transform the otu_tab.

> otu_tab_t<-t(otu_tab)

> head(otu_tab_t,3)

> dim(otu_tab_t)

[1] 40 59

Step 3: Create analysis dataset-outcomes and covariates.

# Create outcome matrix or data frame

> colnames(otu_tab_t)

> colnames(meta_tab)

> class(otu_tab_t)

[1] "matrix" "array"

> class(meta_tab)

[1] "data.frame"

> outcome =otu_tab_t

> # Match sample IDs in outcome matrix or data frame and meta_tab data frame

> outcome = outcome[rownames(meta_tab), ]

> class(outcome)

[1] "matrix" "array"

> class(meta_tab)

[1] "data.frame"

> head(outcome,3)

> yy = as.matrix(outcome)

> yy = ifelse(is.na(yy), 0, yy) #Exclude missing outcome variables if any

> dim(yy)

[1] 40 59

There are 59 taxa in otu_tab. To effectively model the outcomes, we first reduce the number of taxa in the analysis.

In the literature, there are no unique procedures to filter abundance taxa data before modeling. In general, the unclassified taxa were removed, then either read counts or relative abundance data were used. Some researchers renormalized the data after the unclassified taxa were removed, while others directly used the data but adjusted the total read counts during the modeling. Some researchers consider low read counts contribute less to outcome results (although this argument underestimates the functional analysis or the role of rare taxa) and hence removed the low read counts based on a cutoff value. Because too few zeros and too many zeros in the abundance data may cause the models be inaccurate, most models used the proportion of zeros to filter abundance taxa data. Some even used double functions (sum() and quantile() functions) (Xia et al. 2018b) to filter the data. Actually, such arbitrary data preprocessing is an unapproved practice in microbiome research. We should recognize that it is a limitation of current microbiome research.

Here, we use sum() and quantile() functions to double filter the taxa that results in 17 taxa to be included in the analysis.

> # Filter abundance taxa data

> # Here we use double filters

> filter1 <- apply(yy, 2, function(x){sum(x>0)>0.4*length(x)})

> filter2 <- apply(yy, 2, function(x){quantile(x,0.9)>1})

> filter12 <-yy[,filter1&filter2]

> cat('after filter:','samples','taxa',dim(filter12),'\n')

after filter: samples taxa 40 17

> #cat(colnames(filter12),'\n')

> colnames(filter12)

[1] "D_6__Anaerotruncus sp. G3(2012)"

[2] "D_6__Clostridiales bacterium CIEAF 012"

[3] "D_6__Clostridiales bacterium CIEAF 015"

[4] "D_6__Clostridiales bacterium CIEAF 020"

[5] "D_6__Clostridiales bacterium VE202-09"

[6] "D_6__Clostridium sp. ASF502"

[7] "D_6__Enterorhabdus muris"

[8] "D_6__Eubacterium sp. 14-2"

[9] "D_6__gut metagenome"

[10] "D_6__Lachnospiraceae bacterium A4"

[11] "D_6__Lachnospiraceae bacterium COE1"

[12] "D_6__mouse gut metagenome"

[13] "D_6__Mucispirillum schaedleri ASF457"

[14] "D_6__Parabacteroides goldsteinii CL02T12C30"

[15] "D_6__Ruminiclostridium sp. KB18"

[16] "D_6__Staphylococcus sp. UAsDu23"

[17] "D_6__Streptococcus danieliae"

If the meta data are incomplete, we can use the complete.cases() function to subset the data and apply complete cases.

Now we can define outcome matrix or data frame and determine the final taxa in the analysis.

> # Match the number of taxa to the filter12 dataset in the analysis

> yy = filter12[rownames(meta_tab), ]

> dim(yy)

[1] 40 17

> # Create a matrix to hold the taxa

> taxa_all <- colnames(filter12)

> taxa_all

The following commands are used to specify covariates for the model.

> N = meta_tab[, "Total.Read"] # total reads

> mean(N); sd(N)

[1] 55619

[1] 7524

> mean(log(N)); sd(log(N))

[1] 10.92

[1] 0.1417

> subject = meta_tab[, "MouseID"]

> group = meta_tab[, "Group"]

> time = meta_tab[, "Time"]

Step 4: Run the lme() function in the nlme package to implement LMMs.

The lme () function is available in the R package nlme, which was developed to fit linear mixed models of the form described in Pinheiro and Bates (2000) (Pinheiro and Bates 2000, 2006).

We can assume independence of correlation matrix by setting R matrix as an identity matrix, which is the most simplified structure. However, various correlation matrices can be specified (Pinheiro and Bates 2000, 2006) such as autoregressive of order 1 [AR(1)] or continuous-time AR(1). The lme () function allows for multiple and correlated group-specific (random) effects (the argument random) and various types of within-group correlation structures (the argument correlation) described by corStruct in the nlme package.

Below we fit LMMs with main host factors: group and time as well as their interaction as fixed effects and random intercept only (Group*Time, random = ~ 1|Subject). Other LMMs also can be fitted such as with main host factors: group, time, and their interaction as fixed effects and random intercept only and specifying correlation matrix (Group*Time, random = ~ 1|Subject, correlation = corAR1()).

First we need to create list() and matrix to hold the objects of modeling results, which could be used as further processing the results and model comparisons. Then we need to call library(nlme) and run the lme() function iteratively to fit multiple taxa (outcome variables).

> yy <- yy[, colSums(is.na(yy)) == 0]

> # Create list() and the objects to hold the modeling results

> mod= list()

> rest1 = rest2 = rest3 = matrix(NA, ncol(yy), 1)

> library(nlme)

> for (j in 1:ncol(yy)){

+ tryCatch({

+ #filter12= na.omit(filter12)

+ y = yy[, j]

+ yt = asin(sqrt(y/N))

+ df = data.frame(y=y, Group=group, Time=time, Subject=subject)

+ mod= lme(yt ~ Group*Time, random = ~ 1|Subject,data=df)

+ rest_tab = summary(mod)$tTable

+ rest1[j, ] = summary(mod)$tTable[2, 5]

+ rest2[j, ] = summary(mod)$tTable[3, 5]

+ rest3[j, ] = summary(mod)$tTable[4, 5]

+ }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})

+ }

We can check the last outcome results using the object rest_tab and summary(mod).

> rest_tab

Value Std.Error DF t-value p-value

(Intercept) 0.0140953 0.002828 36 4.98473 1.574e-05

Group -0.0003452 0.003999 36 -0.08631 9.317e-01

TimePost 0.0043009 0.003999 36 1.07551 2.893e-01

Group:TimePost -0.0012754 0.005655 36 -0.22551 8.229e-01

> summary(mod)

Step 5: Adjust P-values.

> all_lmm <- as.data.frame(cbind(rest1, rest2,rest3))

> # Adjust p-values

> Group_lmm_adj <-round(p.adjust(rest1,'fdr'),4)

> TimePost_lmm_adj <-round(p.adjust(rest2,'fdr'),4)

> Group.TimePost_lmm_adj <-round(p.adjust(rest3,'fdr'),4)

> all_lmm_adj <- as.data.frame(cbind(taxa_all,Group_lmm_adj,TimePost_lmm_adj,Group.TimePost_lmm_adj))

Step 6: Write table to contain the results (Table 15.2).

Table 15.2

Modeling results from linear mixed effects model based on QtRNA data

	taxa_all	Group_Imm_adj	TimePost_Imm_adj	Group.TimePost_Imm_adj
1	D_6__Anaerotruncus sp. G3(2012)	0.5954	2E-04	0.8432
2	D_6__Clostridiales bacterium CIEAF 012	0.9589	0.6811	0.4371
3	D_6__Clostridiales bacterium CIEAF 015	0.4575	2E-04	0.3968
4	D_6__Clostridiales bacterium CIEAF 020	0.3038	0.1708	0.1857
5	D_6__Clostridiales bacterium VE202-09	0.9589	0.2281	0.4862
6	D_6__Clostridium sp. ASF502	0.8211	0.3766	0.5562
7	D_6__Enterorhabdus muris	0.1482	0.8121	0.3488
8	D_6__Eubacterium sp. 14-2	0.0654	6E-04	0.1857
9	D_6__gut metagenome	0.8211	0.8808	0.5562
10	D_6__Lachnospiraceae bacterium A4	0.8309	0.6208	0.4494
11	D_6__Lachnospiraceae bacterium COE1	0.1329	0.4969	0.0277
12	D_6__mouse gut metagenome	0.4186	3E-04	0.8432
13	D_6__Mucispirillum schaedleri ASF457	0.0644	0.7033	0.1857
14	D_6__Parabacteroides goldsteinii CL02T12C30	0.353	0.0026	0.9993
15	D_6__Ruminiclostridium sp. KB18	0.8211	0.0026	0.1857
16	D_6__Staphylococcus sp. UAsDu23	0.1482	6E-04	0.3249
17	D_6__Streptococcus danieliae	0.9589	0.4471	0.8743

> # Table 15.2

> # Write results table

> # Make the table

> library(xtable)

> table <- xtable(all_lmm_adj,caption = "Table of significant taxa",lable="sig_taxa_table")

> print.xtable(table,type="html",file = "LMMs_Table_Model_QtRNA.html")

> write.csv(all_lmm_adj ,file = paste("LMMs_Table_Model_QtRNA.csv",sep = ""))

15.3 Modeling the Diversity Indices Using the lme4 and LmerTest Packages

In this section, we introduce the lme4 and lmerTest packages and illustrate how to fit LMMs to analyze the diversity indices.

15.3.1 Introduction to the lme4 and lmerTest Packages

The lme4 package (Bates 2005; Bates et al. 2015) and its lmer() function were developed by Bates and his colleagues to provide more flexible fitting of linear mixed models and provide extensions to generalized linear mixed models as well. Compared to the lme () function, the lmer() function fits a larger range of models and is more reliable and faster, but the model specification has been changed slightly. For the details on how to use the lmer() function to fit linear mixed models, and the differences between lmer() and lme(), the interested reader is referred to the cited references for description of the methods (Bates 2005; Bates et al. 2015).

The R package lmerTest (Kuznetsova et al. 2017) (latest version 3.1-3, October 2020) was developed to facilitate the functionality of the “lmerMod” class of the lme4 package, including:

Obtain the P-values for the F-and t-tests for objects returned by the lmer() through overloading the anova() and summary() functions of fixed effects.
Implement the Satterthwaite’s method for approximating degrees of freedom for the t-and F-tests.
Implement the construction of Type I-III ANOVA tables.
Obtain the summary and the anova table using the Kenward-Roger approximation for denominator degrees of freedom by using the KRmodcomp() function in the pbkrtest package.
Perform backward elimination (a step-down model building method) of both random and fixed non-significant effects.
Calculate population means and perform multiple comparison tests as well as plot facilities.

The lme4 package provides F-statistics and the t-statistics in anova() and summary() functions, respectively; however, the P-values for the corresponding F- and t-tests are not provided. The lmerTest package implements and wraps Satterthwaite’s method (Giesbrecht and Burns 1985; Hrong-Tai Fai and Cornelius 1996) into anova() and summary() functions for the object returned by lmer(). Satterthwaite’s method has been implemented in SAS software (SAS Institute Inc. 1978, 2013). The SAS users should be familiar with Satterthwaite’s method. As an alternative, the Kenward-Roger approximation method is also available, which is integrated through the KRmodcomp() function of the pbkrtest package (Halekoh and Højsgaard 2014, 2021). For details, the interested reader is referred to the paper of lmerTest package (Kuznetsova et al. 2017).

15.3.2 Fit LMMs Using the Diversity Index as the Outcome

Example 15.2: Breast Cancer QTRT1 Mouse Gut Microbiome, Example 15.1 Cont.

In this subsection, we use the same dataset from Example 15.1 to assess the association between microbiome diversity and group status (Genotype) by setting the Chao 1 richness index as the outcome and the group and time as the fixed-effect terms.

We fit the LMMs using the R package lme4 (Bates et al. 2015) and perform the corresponding statistical test using the R package lmerTest (Kuznetsova et al. 2017).

For illustration, we select the Chao 1 richness index as the response variable and the group, time, and their interaction as the fixed-effect terms. We also consider group and group by time interaction as random terms. We use the following five steps to perform LMMs and illustrate the functionality of lme4 and LmerTest packages.

The first two steps have done in Chap. 9 (Example 9.1), we duplicate them here for the reader’s convenience.

Step 1: Load taxa abundance data and meta data.

> setwd("~/Documents/QIIME2R/Ch15_LMMs")

> otu_tab <- read.csv("otu_table_L7_MCF7_amp.csv", check.names = FALSE)

> meta_tab <- read.csv("metadata_QtRNA_amp.csv", check.names = FALSE)

Step 2: Calculate alpha-diversity indices.

Here we choose Chao 1 richness index as response variable. Similarly, other diversity indices can be applied. Below we use the ampvis2 package to calculate Chao 1 richness index. It can be calculated using other R packages such as vegan and microbiome packages.

> # Combine otu-table and meta-table using the amp_load() function

> library(ggplot2)

> library(ampvis2)

> ds <- amp_load(otutable = otu_tab, metadata = meta_tab)

We can calculate all the alpha-diversity indices including Chao 1 and ACE by specifying richness=TRUE that are available in the ampvis2 package.

> options(width=110,digits=4)

> alpha_amp <- amp_alphadiv(ds,

+ measure =,

+ richness = TRUE,

+ rarefy = NULL

+ )

> head(alpha_amp,3)

SampleID Sample MouseID Genotype Group Time Group4 Total_Read Reads ObservedOTUs Shannon Simpson invSimpson Chao1 ACE

Sun040.BH5 Sun040.BH5 40 BH5 KO 1 Before KO_BEFORE 39824 39813 110 2.385 0.8270 5.779 141.0 156.1

Sun039.BH4 Sun039.BH4 39 BH4 KO 1 Before KO_BEFORE 42068 42056 112 2.378 0.8336 6.009 160.0 170.5

Sun027.BF2 Sun027.BF2 27 BF2 WT 0 Before WT_BEFORE 42738 42729 101 2.276 0.7784 4.512 153.8 162.9

Step 3: Run LMMs using the ANOVA method for objects returned by lmer() function.

> install.packages("lmerTest")

> library(lme4)

> library("lmerTest")

> chao1 <- lmer(Chao1 ~ Group*Time + (1 |Group), data = alpha_amp)

The lmerTest package provides Type I, II, and III ANOVA tables as defined in the SAS software (SAS Institute Inc. 1978). The Type II and III tables do not depend on the order in which the effects are entered in the model, whereas the Type I ANOVA table is order dependent which performs the sequential decomposition of the contributions of the fixed-effects. The Type I table is the one produced by the ANOVA method of the lme4 package (Bates et al. 2015; Kuznetsova et al. 2017).

For balanced cases, these three ANOVA tables give same results. By default the lmerTest package provides the Type III ANOVA table, while lme4 provides the sequential (Type I) ANOVA table.

> anova(chao1)

Type III Analysis of Variance Table with Satterthwaite's method

Sum Sq Mean Sq NumDF DenDF F value Pr(>F)

Group 405 405 1 36 0.56 0.458

Time 592 592 1 36 0.82 0.370

Group:Time 2376 2376 1 36 3.30 0.077 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The columns DenDF and Pr(>F) refer to denominator degrees of freedom and P-values, which are calculated using the Satterthwaite’s method of approximation. Based on the P-values, the Chao 1 richness index for Group and Time interaction is marginally significant (P = 0.077).

> anova(chao1, type = 1)

Type I Analysis of Variance Table with Satterthwaite's method

Sum Sq Mean Sq NumDF DenDF F value Pr(>F)

Group 405 405 1 36 0.56 0.4579

Time 6916 6916 1 36 9.61 0.0037 **

Group:Time 2376 2376 1 36 3.30 0.0775 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The Time variable is statistically significant (P = 0.0037) and the Group by Time interaction is marginally statistically significant (P = 0.0775) from the Type I ANOVA table with Satterthwaite’s approximation.

We can require another type of ANOVA by changing the type argument and Kenward-Roger’s method for calculating the F-test. The following commands require the Type I ANOVA table with Kenward-Roger’s approximation.

> anova(chao1, type = 1, ddf = "Kenward-Roger")

Type I Analysis of Variance Table with Kenward-Roger's method

Sum Sq Mean Sq NumDF DenDF F value Pr(>F)

Group 405 405 1 60.8 0.56 0.4559

Time 6916 6916 1 36.0 9.61 0.0037 **

Group:Time 2376 2376 1 36.0 3.30 0.0775 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Kenward-Roger’s method achieves the same results.

Step 4: Run LMMs using the summary method for objects returned by the lmer() function.

> summary(chao1)

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']

Formula: Chao1 ~ Group * Time + (1 | Group)

Data: alpha_amp

REML criterion at convergence: 348.2

Scaled residuals:

Min 1Q Median 3Q Max

-1.758 -0.710 -0.276 0.749 2.235

Random effects:

Groups Name Variance Std.Dev.

Group (Intercept) 10.8 3.28

Residual 719.4 26.82

Number of obs: 40, groups: Group, 2

Fixed effects:

Estimate Std. Error df t value Pr(>|t|)

(Intercept) 170.23 9.09 36.00 18.72 <2e-16 ***

Group -8.16 12.86 36.00 -0.63 0.530

TimePost 10.88 11.99 36.00 0.91 0.370

Group:TimePost 30.83 16.96 36.00 1.82 0.077 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:

(Intr) Group TimPst

Group -0.707

TimePost -0.660 0.466

Group:TmPst 0.466 -0.660 -0.707

optimizer (nloptwrap) convergence code: 0 (OK)

unable to evaluate scaled gradient

Hessian is numerically singular: parameters are not uniquely determined

In the output, additional columns of df and Pr(>|t|) in the fixed effects are added to the lme4 package. The df refers to degrees of freedom based on Satterthwaite’s approximation and Pr(>|t|) is the P-value for the t-test with df as degrees of freedom.

The following commands require the Kenward-Roger’s approximation.

> summary(chao1, dff="Kenward-Roger")

Step 5: Perform backward elimination using the step method for objects returned by the lmer() function.

To illustrate the step method for backward elimination, here we consider a full model including Group, Time, and their interaction as fixed-effect terms, as well as Group and Group by Time interaction as random-effect terms.

> chao1a <- lmer(Chao1 ~ Group*Time + (1 | Group)+ (1 | Group:Time), data = alpha_amp)

> b_step <- step(chao1a)

> b_step

Backward reduced random-effect table:

Eliminated npar logLik AIC LRT Df Pr(>Chisq)

<none> 6 -174 360

(1 | Group:Time) 1 5 -174 358 5.68e-14 1 1

Backward reduced fixed-effect table:

Eliminated Df Sum of Sq RSS AIC F value Pr(>F)

Group:Time 1 1 2376 28273 268 3.30 0.0775 .

Group 2 1 526 28800 267 0.69 0.4119

Time 0 1 6916 35716 274 9.13 0.0045 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model found:

Chao1 ~ Time

The reduced model showed that Group and Time interaction in fixed-effect terms is marginally significant (P = 0.0775) and Time is significant (P = 0.0045).

15.4 Implement LMMs Using QIIME 2

Example 15.3: Mouse Gut Microbiome Data, Example 9.2, Cont.

In Chap. 9 (Sect. 9.5.4), we used this study to illustrate how to obtain core alpha diversities via QIIME 2 and then we used pairwise Kruskal-Wallis test to compare Sex by Time variables interaction. There are total 349 samples (208 for early and 141 for late times). The results showed that later time has larger Shannon diversity than early time with P-value of 0.003227 and Q-value of 0.003227. In Chap. 9, we commented that Kruskal-Wallis test does not consider the dependency of the individuals/subjects and a longitudinal model is more appropriate. Here we use QIIME 2 to implement LMMs to test whether Shannon diversity index changed over time and had gender difference in this study.

15.4.1 Introduction to the QIIME Longitudinal Linear-Mixed-Effects Command

Linear mixed-effects models (LMMs) are more flexible in modeling the association between a single response variable and one or more independent variables in a longitudinal or repeated-measures design. The qiime longitudinal linear mixed-effects perform LMMs to evaluate the covariates effects of “group_columns” and “random_effects” to a single dependent variable (“metric”) and plot line plots of each group column. To implement LMMs, at least one numeric state-column (e.g., time variable) and one or more comma-separated group-columns need to be specified as the fixed effects, i.e., independent variables. These fixed effects may be categorical or numeric metadata columns. After implementing this model, regression plots of the response variable (“metric”) are plotted as a function of the state column and each group column.

15.4.2 Fit LMMs in QIIME 2

First, we need activate QIIME 2. Open the terminal and type: source activate qiime2-2022.2, which is latest version of QIIME 2. When the terminal appears “(qiime2-2022.2),” it indicates qiime2-2022.2 version has been activated. We can work with this version now.

We already create a folder LMM in the path:QIIME2-Biostatistics/longitudinal/LMM, so we can access to this folder using cd QIIME2R-Biostatistics/longitudinal/LMM. If this folder did not exist in your computer, you can create one: mkdir QIIME2R-Biostatistics/longitudinal/LMM. The following commands perform LMMs by specifying Sex and Time as fixed-effects terms, but not requiring the random-effects (Fig. 15.1 for Sex effects and Fig. 15.2 for Time effects).

Fig. 15.1
Regression scatterplots of Shannon diversity by male and female mice

Fig. 15.2
Regression scatterplots of Shannon diversity by early and later times

# Figures 15.1 and 15.2

qiime longitudinal linear-mixed-effects \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv\

--m-metadata-file shannon_vector.qza\

--p-metric shannon\

--p-group-columns Sex,Time\

--p-state-column DPW\

--p-individual-id-column StudyID\

--o-visualization LMMEffectsMiSeq_SOP.qzv

where the four columns are required: (1) sample metadata (here, SampleMetadataMiSeq_SOP.tsv) which contains individual-id-column; (2) state-column parameter: metadata column containing state (time) variable (here, DPW); (3) individual-id-column parameter, metadata column containing study IDs for individual subjects (here, StudyID), which indicates the individual subject/site that was sampled repeatedly; and (4) the o-visualization output column.

The p-metric is used to specify the response (dependent) variable column name, which must be located in the metadata or feature table files if the feature table is provided as input data. The p-group columns is used to specify the metadata columns as the independent covariates that are used to determine mean structure of “metric.” Several fixed-effects variables can be specified via a comma-separated list. The p-random-effects parameter can be added. The random-effects metadata columns are used to specify the independent covariates that are used to determine the variance and covariance structure (random effects) of “metric.” A random intercept for each individual is set by default, while to specify a random slope, set the variable in the p-state-column, in which the state-column value is passed as input to the random-effects parameter. Same to fixed effects, to specify several random effects, a comma-separated list of random effects can be provided in the p-random-effects column. The o-visualization output is used to name VISUALIZATION outputs. We here named it as LMMEffectsMiSeq_SOP.qzv.

By reviewing LMMEffectsMiSeq_SOP.qzv via qiime2 view, we can see that there were no statistically significant effects of sex (P = 0.088) or sampling period (early vs. late, P = 0.356) on Shannon diversity index, which confirmed the results that were originally reported in Schloss et al. (2012).

The qiime longitudinal linear-mixed-effects command returns several results, including (1) the input parameters at the top of the visualization, (2) the Model summary of descriptive information about the LMMs, (3) the main Model results, which summarizes the effects of each fixed effect (and their interactions) on the dependent variable (here, Shannon diversity). This table shows parameter estimates, including standard errors, z scores, P-values (P>|z|), and 95% confidence interval upper and lower bounds for each parameter, (4) the Regression scatterplots that categorized by each “group column” at the bottom of the visualization, with linear regression lines and 95% confidence interval in gray for each group, and (5) Projected Residuals, the scatterplots of fit vs. residual plots. The plots show the relationship between metric predictions for each sample (on the x-axis), and the residual or observation error (prediction - actual value) for each sample (on the y-axis). They are used for diagnostics. The roughly zero-centered residuals suggest a well-fitted model.

Below we add random-effects parameters and specify DPW as a random effect. For details of how specifying random effects, the reader is referred to Sect. 15.1.5 How to Fit LMMs.

qiime longitudinal linear-mixed-effects \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv\

--m-metadata-file shannon_vector.qza\

--p-metric shannon\

--p-group-columns Sex,Time\

--p-random-effects DPW\

--p-state-column DPW\

--p-individual-id-column StudyID\

--o-visualization LMMEffectsMiSeq_SOP_Random.qzv

The above regression scatterplots in Figs. 15.1 and 15.2 just provide a quick summary information of the data. To interactively plot the longitudinal data, QIIME 2 recommends using the volatility plot.

15.4.3 Perform Volatility Analysis

Volatility analysis generates interactive line plots to visualize how a dependent variable is volatile over a continuous, independent variable (e.g., time) in group(s). Thus, a volatility plot is a good qualitative tool to identify outliers that disproportionately drive the variance within individuals and groups.

The input data can be metadata files (e.g., alpha and beta diversity artifacts) and FeatureTable[RelativeFrequency] tables. Different dependent variables can be plotted on the y-axis. The following commands examine how variance in Shannon diversity and Sex changes over DPW (day post weaning) (specified in the state-column) (Fig. 15.3).

# Figure 15.3

qiime longitudinal volatility \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv\

--m-metadata-file shannon_vector.qza\

--p-default-metric shannon \

--p-default-group-column Sex\

--p-state-column DPW\

--p-individual-id-column StudyID \

--o-visualization VolatilitySexLMMEffectsMiSeq_SOP.qzv

Fig. 15.3
Volatility plots of Shannon diversity by male and female mice

The following commands examine how variance in Shannon diversity and Time (early and later periods) changes over DPW (day post weaning) (specified in the state-column) (Fig. 15.4).

Fig. 15.4
Volatility plots of Shannon diversities by early and later times

# Figure 15.4

qiime longitudinal volatility \

--m-metadata-file SampleMetadataMiSeq_SOP.tsv\

--m-metadata-file shannon_vector.qza\

--p-default-metric shannon \

--p-default-group-column Time\

--p-state-column DPW\

--p-individual-id-column StudyID \

--o-visualization VolatilityTimeLMMEffectsMiSeq_SOP.qzv

By reviewing the resulted output visualization using QIIME 2 view, the plot displays a line plot on the left-hand side of and a panel of “Plot Controls” to the right-hand side.

The “Plot Controls” panel is used to interactively adjust variables and parameters for determining how “groups” and “individuals” values change across the specified state-column (a single independent variable). For details on how use the interactive features, the reader is referred to QIIME 2 documents for further references or practicing the output visualization in QIIME 2 view.

15.5 Remarks on LMMs

The development of new methods for analysis of microbiome data often takes the advantage of LMM framework such as in Zhang and Yi (2020) and Hui et al. (2017). However, LMM is not specifically designed for microbiome data, and hence it does not address issues due to characteristics of microbiome data. LMM method is often criticized for using to analyze microbiome data because (1) it needs to transform absolute abundance read account into the relative abundance via a transformation method (e.g., arcsine square root transformation) before fitting the model, and most often treats transformed microbiome abundance as normally distributed responses (such as in Srinivas et al. 2013; Rosa et al. 2014; Leamy et al. 2014; Wang et al. 2015); (2) it does not explicitly handle the excess zeros in the data (Chen and Li 2016), and thus cannot fit the model with zero-inflation and over-dispersion to address the sparsity issue (Zhang and Yi 2020). Thus, directly applying LMM method to analyze microbiome data may be underpowered and have potential bias to identify the dynamic microbiome effects.

Several proposed longitudinal microbiome models have compared their methods to LMM method (Chen and Li 2016; Zhang et al. 2018; Zhang and Yi 2020).

15.6 Summary

This chapter focused on LMMs to microbiome data analysis. Section 15.1 covered the general topics of LMMs including the advantages and disadvantages of using LMMs, definitions of fixed and random effects, and its formulation, statistical hypothesis testing, and the procedures of fitting LMMs. Section 15.2 described how to identify the significant taxa with the nlme package. A general introduction to LMMs in microbiome research and longitudinal microbiome data structure were also described. Section 15.3 introduced the lme4 and LmerTest packages and how to use these two packages to analyze the diversity indices. Section 15.4 described how to implement LMMs in QIIME 2, including introduction to the QIIME longitudinal linear-mixed-effects command and volatility analysis. Section 15.5 provided some general remarks on LMMs in microbiome data.

Chapter 16 will describe the generalized linear mixed models (GLMMs), and Chap. 17 will introduce GLMMs and zero-inflated GLMMs for longitudinal microbiome data.

References

Arnau, Jaume, Roser Bono, Nekane Balluerka, and Arantxa Gorostiaga. 2010. General linear mixed model for analysing longitudinal data in developmental research. Perceptual and Motor Skills 110 (2): 547–566.CrossrefPubMed
Bates, Douglas. 2005. Fitting linear mixed models in R. R News 5 (1): 27–30.
Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67 (1): 48. https://doi.org/10.18637/jss.v067.i01. https://www.jstatsoft.org/v067/i01.Crossref
Breslow, Norman E., and David G. Clayton. 1993. Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88 (421): 9–25.
Chen, E.Z., and H. Li. 2016. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics 32 (17): 2611–2617. https://doi.org/10.1093/bioinformatics/btw308.CrossrefPubMedPubMedCentral
Crawley, Michael J. 2012. The R book. 2nd ed. Wiley.Crossref
Diggle, P.J., P. Heagerty, K.-Y. Liang, and S.L. Zeger. 2002. Analysis of longitudinal data. 2nd ed. Oxford: Oxford University Press.
DiGiulio, D.B., B.J. Callahan, P.J. McMurdie, E.K. Costello, D.J. Lyell, A. Robaczewska, C.L. Sun, D.S. Goltsman, R.J. Wong, G. Shaw, D.K. Stevenson, S.P. Holmes, and D.A. Relman. 2015. Temporal and spatial variation of the human microbiota during pregnancy. Proceedings of the National Academy of Sciences of the United States of America 112 (35): 11060–11065. https://doi.org/10.1073/pnas.1502875112.CrossrefPubMedPubMedCentral
Ernest, Ben, Jessica R. Gooding, Shawn R. Campagna, Arnold M. Saxton, and Brynn H. Voy. 2012. MetabR: An R script for linear model analysis of quantitative metabolomic data. BMC Research Notes 5: 596–596. https://doi.org/10.1186/1756-0500-5-596. https://www.ncbi.nlm.nih.gov/pubmed/23111096. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3532230/.CrossrefPubMedPubMedCentral
Fabregat-Traver, Diego, Sodbo Zh. Sharapov, Caroline Hayward, Igor Rudan, Harry Campbell, Yurii Aulchenko, and Paolo Bientinesi. 2014. High-performance mixed models based genome-wide association analysis with omicABEL software. F1000Research 3: 200–200. https://doi.org/10.12688/f1000research.4867.1. https://www.ncbi.nlm.nih.gov/pubmed/25717363. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4329600/.CrossrefPubMedPubMedCentral
Fisher, R.A. 1918. The correlation between relatives on the supposition of mendelian inheritance. Transactions of the Royal Society of Edinburgh 52 (02): 399–433.Crossref
Fitzmaurice, G.M., N.M. Laird, and J.H. Ware. 2004. Applied longitudinal analysis. Hoboken: Wiley.
Gelman, Andrew, and Jennifer Hill. 2007. Data analysis using regression and hierarchical/multilevel models. New York: Cambridge University Press.
Giesbrecht, F.G., and J.C. Burns. 1985. Two-stage analysis based on a mixed model: Large-sample asymptotic theory and small-sample simulation results. Biometrics 41 (2): 477–486. https://doi.org/10.2307/2530872. http://www.jstor.org.proxy.cc.uic.edu/stable/2530872.Crossref
Goldberger, Arthur S. 1962. Best linear unbiased prediction in the generalized linear regression model. Journal of the American Statistical Association 57 (298): 369–375.Crossref
Grueber, C.E., S. Nakagawa, R.J. Laws, and I.G. Jamieson. 2011. Multimodel inference in ecology and evolution: Challenges and solutions. Journal of Evolutionary Biology 24 (4): 699–711. https://doi.org/10.1111/j.1420-9101.2010.02210.x. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1420-9101.2010.02210.x.CrossrefPubMed
Halekoh, Ulrich, and Søren Højsgaard. 2014. A kenward-roger approximation and parametric bootstrap methods for tests in linear mixed models–The R package pbkrtest. Journal of Statistical Software 59 (9): 1–30.Crossref
———. 2021. pbkrtest: Parametric bootstrap, kenward-roger and satterthwaite based methods for test in mixed models. Last Modified 2021-03-09. https://people.math.aau.dk/~sorenh/software/pbkrtest/.
Harrison, Xavier A. 2015. A comparison of observation-level random effect and Beta-Binomial models for modelling overdispersion in Binomial data in ecology & evolution. PeerJ 3: e1114. https://doi.org/10.7717/peerj.1114.CrossrefPubMedPubMedCentral
Harrison, Xavier A., Lynda Donaldson, Maria Eugenia Correa-Cano, Julian Evans, David N. Fisher, Cecily E.D. Goodwin, Beth S. Robinson, David J. Hodgson, and Richard Inger. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6: e4794–e4794. https://doi.org/10.7717/peerj.4794. https://pubmed.ncbi.nlm.nih.gov/29844961. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5970551/.CrossrefPubMedPubMedCentral
Harville, David. 1976a. Extension of the Gauss-Markov theorem to include the estimation of random effects. The Annals of Statistics 4: 384–395.Crossref
Harville, David A. 1976b. Confidence intervals and sets for linear combinations of fixed and random effects. Vol. 32, 403–407. Biometrics.
———. 1977. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association 72 (358): 320–338.Crossref
Henderson, Charles R. 1950. Estimation of genetic parameters. Biometrics 6 (2): 186–187.
———. 1963. Selection Index and Expected Genetic Advance. In: Statistical Genetics and Plant Breeding, National Academy of Sciences, No. 982, National Research Council Publication, Washington DC, 141–163.
———. 1973. Sire evaluation and genetic trends. Journal of Animal Science 1973 (Symposium): 10–41.Crossref
Henderson, C.R. 1984. Applications of linear models in animal breedlinig. Guelph: University of Guelph.
Henderson, C.R., S.R. Oscar Kempthorne, and Searle, and C. M. von Krosigk. 1959. The estimation of environmental and genetic trends from records subject to culling. Biometrics 15 (2): 192–218. https://doi.org/10.2307/2527669. http://www.jstor.org/stable/2527669.Crossref
Hrong-Tai Fai, Alex, and Paul L. Cornelius. 1996. Approximate F-tests of multiple degree of freedom hypotheses in generalized least squares analyses of unbalanced split-plot experiments. Journal of Statistical Computation and Simulation 54 (4): 363–378. https://doi.org/10.1080/00949659608811740.Crossref
Hui, Francis K.C., Samuel Müller, and A.H. Welsh. 2017. Joint selection in mixed models using regularized PQL. Journal of the American Statistical Association 112 (519): 1323–1333. https://doi.org/10.1080/01621459.2016.1215989.Crossref
Kostic, Aleksandar D., Dirk Gevers, Heli Siljander, Tommi Vatanen, Tuulia Hyötyläinen, Anu-Maaria Hämäläinen, Aleksandr Peet, Vallo Tillmann, Päivi Pöhö, Ismo Mattila, Harri Lähdesmäki, Eric A. Franzosa, Outi Vaarala, Marcus de Goffau, Hermie Harmsen, Jorma Ilonen, Suvi M. Virtanen, Clary B. Clish, Matej Orešič, Curtis Huttenhower, Mikael Knip, Diabimmune Study Group, and Ramnik J. Xavier. 2015. The dynamics of the human infant gut microbiome in development and in progression toward type 1 diabetes. Cell Host & Microbe 17 (2): 260–273. https://doi.org/10.1016/j.chom.2015.01.001. https://pubmed.ncbi.nlm.nih.gov/25662751. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4689191/.Crossref
Kuznetsova, Alexandra, Per B. Brockhoff, and Rune H.B. Christensen. 2017. ImerTest package: Tests in linear mixed effects models. Journal of Statistical Software 82 (13): 26. https://doi.org/10.18637/jss.v082.i13. https://www.jstatsoft.org/v082/i13.Crossref
Laird, N.M., and J.H. Ware. 1982. Random-effects models for longitudinal data. Biometrics 38 (4): 963–974.CrossrefPubMed
Leamy, Larry J., Scott A. Kelly, Joseph Nietfeldt, Ryan M. Legge, Fangrui Ma, Kunjie Hua, Rohita Sinha, Daniel A. Peterson, Jens Walter, Andrew K. Benson, and Daniel Pomp. 2014. Host genetics and diet, but not immunoglobulin A expression, converge to shape compositional features of the gut microbiome in an advanced intercross population of mice. Genome Biology 15 (12): 552–552. https://doi.org/10.1186/s13059-014-0552-6. https://pubmed.ncbi.nlm.nih.gov/25516416. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4290092/.CrossrefPubMedPubMedCentral
Lee, Y., J.A. Nelder, and Y. Pawitan. 2006. Generalised linear models with random effects. London: Chapman and Hall.Crossref
Lloyd-Price, Jason, Cesar Arze, Ashwin N. Ananthakrishnan, Melanie Schirmer, Julian Avila-Pacheco, Tiffany W. Poon, Elizabeth Andrews, Nadim J. Ajami, Kevin S. Bonham, Colin J. Brislawn, David Casero, Holly Courtney, Antonio Gonzalez, Thomas G. Graeber, A. Brantley Hall, Kathleen Lake, Carol J. Landers, Himel Mallick, Damian R. Plichta, Mahadev Prasad, Gholamali Rahnavard, Jenny Sauk, Dmitry Shungin, Yoshiki Vázquez-Baeza, Richard A. White, Jason Bishai, Kevin Bullock, Amy Deik, Courtney Dennis, Jess L. Kaplan, Hamed Khalili, Lauren J. McIver, Christopher J. Moran, Long Nguyen, Kerry A. Pierce, Randall Schwager, Alexandra Sirota-Madi, Betsy W. Stevens, William Tan, Johanna J. ten Hoeve, George Weingart, Robin G. Wilson, Vijay Yajnik, Jonathan Braun, Lee A. Denson, Janet K. Jansson, Rob Knight, Subra Kugathasan, Dermot P.B. McGovern, Joseph F. Petrosino, Thaddeus S. Stappenbeck, Harland S. Winter, Clary B. Clish, Eric A. Franzosa, Hera Vlamakis, Ramnik J. Xavier, Curtis Huttenhower, and Ibdmdb Investigators. 2019. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569 (7758): 655–662. https://doi.org/10.1038/s41586-019-1237-9.CrossrefPubMedPubMedCentral
McLean, Robert A., William L. Sanders, and Walter W. Stroup. 1991. A unified approach to mixed linear models. The American Statistician 45 (1): 54–64. https://doi.org/10.2307/2685241. http://www.jstor.org/stable/2685241.Crossref
Pinheiro, J., and D. Bates. 2000. Mixed-effects models in S and S-PLUS. New York: Springer.Crossref
Pinheiro, José, and Douglas Bates. 2006. Mixed-effects models in S and S-PLUS. Springer.
Robinson, G.K. 1991. That BLUP is a good thing: The estimation of random effects. Statistical Science 6 (1): 15–32. http://www.jstor.org/stable/2245695.
Rosa, La, S. Patricio, Barbara B. Warner, Yanjiao Zhou, George M. Weinstock, Erica Sodergren, Carla M. Hall-Moore, Harold J. Stevens, William E. Bennett, Nurmohammad Shaikh, Laura A. Linneman, Julie A. Hoffmann, Aaron Hamvas, Elena Deych, Berkley A. Shands, William D. Shannon, and Phillip I. Tarr. 2014. Patterned progression of bacterial populations in the premature infant gut. Proceedings of the National Academy of Sciences 111 (34): 12522–12527. https://doi.org/10.1073/pnas.1409497111. https://www.pnas.org/content/pnas/111/34/12522.full.pdf.Crossref
SAS Institute Inc. 1978. Tests of hypotheses in fixed-effects linear models: SAS technical report R-101. SAS Institute Inc.
———. 2013. The SAS system, version 9.4. SAS Inc. http://www.sas.com/.
Schloss, Patrick D., Alyxandria M. Schubert, Joseph P. Zackular, Kathryn D. Iverson, Vincent B. Young, and Joseph F. Petrosino. 2012. Stabilization of the murine gut microbiome following weaning. Gut Microbes 3 (4): 383–393.CrossrefPubMedPubMedCentral
Srinivas, Girish, Steffen Möller, Jun Wang, Sven Künzel, Detlef Zillikens, John F. Baines, and Saleh M. Ibrahim. 2013. Genome-wide mapping of gene–microbiota interactions in susceptibility to autoimmune skin blistering. Nature Communications 4 (1): 2462. https://doi.org/10.1038/ncomms3462.CrossrefPubMed
Thompson, Robin. 2008. Estimation of quantitative genetic parameters. Proceedings of the Royal Society B: Biological Sciences 275 (1635): 679–686.CrossrefPubMedCentral
Veroniki, Areti Angeliki, Dan Jackson, Wolfgang Viechtbauer, Ralf Bender, Jack Bowden, Guido Knapp, Oliver Kuss, Julian P.T. Higgins, Dean Langan, and Georgia Salanti. 2016. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Research Synthesis Methods 7 (1): 55–79. https://doi.org/10.1002/jrsm.1164. https://pubmed.ncbi.nlm.nih.gov/26332144. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4950030/.CrossrefPubMed
Wang, Jun, Shirin Kalyan, Natalie Steck, Leslie M. Turner, Bettina Harr, Sven Künzel, Marie Vallier, Robert Häsler, Andre Franke, Hans-Heinrich Oberg, Saleh M. Ibrahim, Guntram A. Grassl, Dieter Kabelitz, and John F. Baines. 2015. Analysis of intestinal microbiota in hybrid house mice reveals evolutionary divergence in a vertebrate hologenome. Nature Communications 6 (1): 6440. https://doi.org/10.1038/ncomms7440.CrossrefPubMed
Xia, Y. 2020. Correlation and association analyses in microbiome study integrating multiomics in health and disease. Progress in Molecular Biology and Translational Science 171: 309–491. https://doi.org/10.1016/bs.pmbts.2020.04.003.CrossrefPubMed
Xia, Yinglin, and Jun Sun. 2021. Longitudinal methods for analysis of microbiome data. In Microbiome & Metabolomics: Statistical Data Analyses, ACS in Focus. American Chemical Society.
Xia, Yinglin, Jun Sun, and Ding-Geng Chen. 2018a. Introductory overview of statistical analysis of microbiome data. Statistical Analysis of Microbiome Data with R: 43–75.
———. 2018b. Modeling zero-inflated microbiome data. In Statistical analysis of microbiome data with R, ed. Yinglin Xia, Jun Sun, and Ding-Geng Chen, 453–496. Singapore: Springer.Crossref
Zhang, X., and N. Yi. 2020. Fast zero-inflated negative binomial mixed modeling approach for analyzing longitudinal metagenomics data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz973.
Zhang, Xinyan, Yu-Fang Pei, Lei Zhang, Boyi Guo, Amanda H. Pendegraft, Wenzhuo Zhuang, and Nengjun Yi. 2018. Negative binomial mixed models for analyzing longitudinal microbiome data. Frontiers in Microbiology 9 (1683). https://doi.org/10.3389/fmicb.2018.01683. https://www.frontiersin.org/article/10.3389/fmicb.2018.01683.
Zhao, Xueheng, Liang Niu, Carlo Clerici, Roberta Russo, Melissa Byrd, and Kenneth D.R. Setchell. 2019. Data analysis of MS-based clinical lipidomics studies with crossover design: A tutorial mini-review of statistical methods. Clinical Mass Spectrometry 13: 5–17. https://doi.org/10.1016/j.clinms.2019.05.002. http://www.sciencedirect.com/science/article/pii/S2376999818300497.CrossrefPubMedPubMedCentral
Zuur, Alain F., and Elena N. Ieno. 2016. A protocol for conducting and presenting results of regression-type analyses. Methods in Ecology and Evolution 7 (6): 636–645.Crossref
Zuur, Alain F., Elena N. Ieno, Neil J. Walker, Anatoly A. Saveliev, and Graham M. Smith. 2009. Mixed effects models and extensions in ecology with R. Vol. 574. New York: Springer.Crossref
Zuur, Alain F., Elena N. Ieno, and Chris S. Elphick. 2010. A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution 1 (1): 3–14.Crossref